if (!require("devtools")) {
install.packages("devtools")
}
devtools::install_github("shbrief/PCAGenomicSignatures")
library(PCAGenomicSignatures)
Currently, you can download PCAGenomicSignatures from Google Cloud bucket using
PCAGenomicSignatures::getModel function. This model is built from top 20 PCs of
536 studies (containing 44,890 samples) containing 13,934 common genes from each
of 536 study’s top 90% varying genes based on thier study-level standard deviation.
There are two versions of this using different gene sets for GSEA-based annotation;
MSigDB C2 (C2) and three priors from PLIER package (PLIERpriors). In this
vignette, we are using the C2 annotated model.
wd <- getwd()
fpath <- file.path(wd, "PCAmodel_C2.rds")
if (!file.exists(fpath)) {getModel("C2", dir = wd)}
PCAmodel <- readRDS(file.path(wd, "PCAmodel_C2.rds"))
PCAmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(6): cluster size ... MeSH_freq updateNote
#> assays(1): model
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): PCcluster1 PCcluster2 ... PCcluster4763 PCcluster4764
#> colData names(4): PCcluster studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
Human B-cell expression dataset The human B-cell dataset (Gene Expression Omnibus series GSE2350) consists of 211 normal and tumor human B-cell phenotypes whose expression was profiled on Affymatrix HG-U95Av2 arrays, and it is contained in an ExpressionSet object with 6,249 features x 211 samples.
if (!"bcellViper" %in% rownames(installed.packages())) {
BiocManager::install("bcellViper")
}
suppressPackageStartupMessages(library(bcellViper))
data(bcellViper)
dset
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 6249 features, 211 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: GSM44075 GSM44078 ... GSM44302 (211 total)
#> varLabels: sampleID description detailed_description
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:
dataset <- exprs(dset) # genes in SYMBOL
You can provide your own expression dataset in any of these formats: simple matrix, ExpressionSet, or SummarizedExperiment. Just make sure that genes in rows are in a ‘symbol’ format.
heatmapTable outputs a two panel table: top panel represents average silhouette
width (avg.sw) and the bottom panel represents the validation score.
You can display the validation output in multiple ways. For example, if you specify
scoreCutoff argument of heatmapTable, any validation result above that score
will be shown. If you specify the number of top validation results (= n)
through num.out argument of heatmapTable, the output will be a n-columned
heatmap table. You can also use the average silhouette width (swCutoff), the
size of cluster (clsizecutoff), PC from dataset (whichPC).
Here, we print out top 5 validated PCclusters with > 0 average silhouette width.
val_all <- validate(dataset, PCAmodel)
heatmapTable(val_all, num.out = 5, swCutoff = 0)
Under the default condition, plotValidate plots all non single-element clusters’
validation results in a single graph, where x-axis represent average silhouette
width of the PCclusters (a quality control measure of the signature) and y-axis
represent validation score. We recommend users to focus on PCclusters with higher
validation score and use avgerage silhouette width as a secondary criteria.
plotValidate(val_all, interactive = FALSE)
plotValidate(val_all, interactive = TRUE, minClusterSize = 4)
You can hover each data point for more information:
- sw : the average silhouette width of the clutser
- score : the top validation score between 8 PCs of the dataset and the cluster
- cl_size : the size of the cluster, represented by the dot size
- cl_num : the PCcluster number. You need this index to find more information about the cluster.
- PC : Test dataset’s PC number that validates the given PCcluster. Because we used
top 8 PCs of the test dataset, there are 8 categories.
If you double-click the PC legend on the right, you will enter an individual display mode where you can add an additional group of data point by single-click.
You can draw a wordcloud with the enriched MeSH term of PCclusters that validate
your dataset. Besed on the heatmap table above, 1st-3rd PCclusters (2538, 1139, 884)
show high validation scores with positive average silhouette widths, so we draw
wordclouds of those PCclusters using drawWordcloud function. You need to provide
PCAmodel and the index of the PCcluster you are interested in.
Index of validated PCclusters can be easily collected using validatedSingatures
function, which outputs the validated index based on num.out, PC from dataset
(whichPC) or any *Cutoff arguments in a same way as heatmapTable.
validatedSig <- validatedSignatures(val_all, num.out = 3, swCutoff = 0)
validated_ind <- validatedSig[,"cl_num"]
set.seed(1)
drawWordcloud(PCAmodel, validated_ind[1])
drawWordcloud(PCAmodel, validated_ind[2])
drawWordcloud(PCAmodel, validated_ind[3])
Because the test dataset is human B-cell expression data, we tried the model annotated with blood-associated gene sets.
wd <- getwd()
fpath <- file.path(wd, "PCAmodel_PLIERpriors.rds")
if (!file.exists(fpath)) {getModel("PLIERpriors", dir = wd)}
PCAmodel <- readRDS(file.path(wd, "PCAmodel_PLIERpriors.rds"))
PCAmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(6): cluster size ... MeSH_freq updateNote
#> assays(1): model
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): PCcluster1 PCcluster2 ... PCcluster4763 PCcluster4764
#> colData names(4): PCcluster studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526